-
Notifications
You must be signed in to change notification settings - Fork 382
4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1765
4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1765
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one question and one concern.
- What is the desired behavior when orphan mitigation fails? In that case, are we going to consider the instance dead and unusable? Or are we going to open the instance up for request retries and processing spec updates?
- Retryable requests will restart the
OperationStartTime
when the orphan mitigation completes. This means that retryable requests involving orphan mitigation have the potential to retry indefinitely.
if shouldMitigateOrphan { | ||
// TODO nilebox: if failedCond == nil, we lose the original error reason/message | ||
// in the status completely there. Should we keep the original reason/message instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose that we add a new OrphanMitigation condition that is true when the instance is undergoing orphan mitigation. The Ready condition is not changed when starting and stopping orphan mitigation. Presumably, the Ready condition will already be false when orphan mitigation starts. Although I'd also be fine with relying solely on OrphanMitigiationInProgress
instead of adding a new condition. Either way, I suggest that we stop changing the Ready condition with the starting orphan mitigation reason. Of course, this is all follow-on work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@staebler I personally don't have objection against OrphanMitigation condition, we can discuss this separately with @pmorie and others.
As you suggest (and as my TODO suggests too), I will keep the original reason/message for now, and probably leave a TODO that it would be nice to somehow reflect the started orphan mitigation in the status somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised an issue #1771 to discuss this.
setServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionFailed, failedCond.Status, failedCond.Reason, failedCond.Message) | ||
errorMessage = fmt.Errorf(failedCond.Message) | ||
} else { | ||
resetServiceInstanceCondition(instance, v1beta1.ServiceInstanceConditionFailed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is resetting the Failed condition part of processing a failure? The Failed condition should be cleared out as part of starting a new provision rather than as part of finishing a provision. As it is, the Failed condition is not cleared out when the provision is successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failedCond == nil
here means that we have encountered a retriable error (see the processTemporaryProvisionFailure
method).
The Failed:True
condition along with the latest ObservedGeneration
means: "terminal failure that doesn't need retrying until the spec is updated (and consequently the Generation is incremented)".
As a reminder from #1748:
+// isServiceInstanceProcessedAlready returns true if there is no further processing
+// needed for the instance based on ObservedGeneration
+func isServiceInstanceProcessedAlready(instance *v1beta1.ServiceInstance) bool {
+ return instance.Status.ObservedGeneration >= instance.Generation &&
+ (isServiceInstanceReady(instance) || isServiceInstanceFailed(instance)) && !instance.Status.OrphanMitigationInProgress
+}
The isServiceInstanceProcessedAlready
is invoked before we start a new provisioning operation, and it will return true
unless we clear the reset the Failed
condition as I do there.
I don't like that this also means that we lose the reason
and message
there, but currently we store the same reason and message in the Ready
condition, so it's probably not too bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/controller/controller.go
Outdated
is5XX | ||
} | ||
|
||
// isTerminalHttpStatus returns whether an error with the given HTTP status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/isTerminalHttpStatus/isRetriableHTTPStatus
// Need to vary return error depending on whether the worker should | ||
// requeue this resource. | ||
if failedCond == nil || shouldMitigateOrphan { | ||
return errorMessage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not strictly required as we have updated the status above, so the instance will come back to the queue anyway.
Keeping it as it is, as it was written that way before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is needed but the comment is misleading. We need to prevent resetting the rate limiter.
/cc @kibbles-n-bytes this is another place where we claim that we want to requeue the resource when what we really need is to prevent invoking the Forget()
method.
I will update the comment.
@staebler You mean when we reached the retry timeout The spec says that the platform SHOULD (i.e. not MUST) keep trying until it succeeds (nothing about time limits or max number of retries):
Also I love this OSB spec paragraph:
It sounds like "best effort but no guarantees whatsoever". Just brilliant. We can discuss this at the next SIG call. |
@staebler Yes. Is this bad? That's how Kubernetes works - it keeps trying to deploy a Pod even if the Docker image doesn't exist in DockerHub. There might be some bug on the broker side leading to I think there should be some monitoring tool that would cleanup all the instances that have been stuck in provisioning for too long and/or notify users (owners) about this issue (i.e. OpenShift and Atlassian will probably have something in-house in that area). But that's out of scope of Service Catalog. If such behavior turns out to be painful, we can fix it later by introducing some retry limits probably. |
Also need to check whether the |
Service catalog specifically has the reconciliation retry duration that determines how long reconciliation of a given generation should be attempted before giving up. Let's say that the provision request failed repeatedly with 408 Request Timeout errors. By default, the controller would stop attempting to send provision requests to the broker after 7 days. At that point, the controller would set the instance to |
@nilebox Yes, that looks to be the only (non-constraint-breaking) reason why an orphan mitigation would fail. I think that we have some inconsistencies in how we are treating broker resources if we allow further provision requests after a failed orphan mitigation. If a pure delete fails (ie, exceeds reconciliation retry duration), then the instance is kept around so that the user may take manual actions to negotiate with the broker to clean up resources. For a failed orphan mitigation, we would not be giving the user the same opportunity for manual intervention. |
Ok, I see the problem now. The easiest solution then is not to invoke the |
@staebler I think we should keep the instance in the "unusable" state if orphan mitigation retry limit is exceeded. This would allow for manual intervention:
To be clear, I understand that we'll need #1771 or some other improvement to make sure that we will not do any provisioning until the orphan mitigation has succeeded. |
@staebler @nilebox Perhaps this is a situation in which it'd be beneficial to use the Would complicate the logic of the controller slightly, but I think it'd get us the desired behavior. |
@kibbles-n-bytes your solution won't work in the case when we exhausted the retry timeout, but then the spec was updated (leading to |
@kibbles-n-bytes @staebler how about adding a new operation type Combined with the |
Actually nevermind, I lost the context of discussion :( |
/retest |
@staebler @kibbles-n-bytes rebased on the latest mater after merged other
@staebler we don't do that anymore (merged in the other PR) - we don't invoke |
/retest |
pkg/controller/controller.go
Outdated
// isRetriableHTTPStatus returns whether an error with the given HTTP status | ||
// code is retriable. | ||
func isRetriableHTTPStatus(statusCode int) bool { | ||
return statusCode != 400 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI there are constants for HTTP status codes in the http
package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving comments for myself
// Need to vary return error depending on whether the worker should | ||
// requeue this resource. | ||
if failedCond == nil || shouldMitigateOrphan { | ||
return errorMessage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is needed but the comment is misleading. We need to prevent resetting the rate limiter.
/cc @kibbles-n-bytes this is another place where we claim that we want to requeue the resource when what we really need is to prevent invoking the Forget()
method.
I will update the comment.
@@ -1662,12 +1681,14 @@ func (c *controller) processUpdateServiceInstanceSuccess(instance *v1beta1.Servi | |||
// processUpdateServiceInstanceFailure handles the logging and updating of a | |||
// ServiceInstance that hit a terminal failure during update reconciliation. | |||
func (c *controller) processUpdateServiceInstanceFailure(instance *v1beta1.ServiceInstance, readyCond, failedCond *v1beta1.ServiceInstanceCondition) error { | |||
// TODO nilebox: We need to distinguish terminal and temporary errors there | |||
// but we need to merge https://github.com/kubernetes-incubator/service-catalog/pull/1748 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1748 is merged, so we can fix this TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra comments on PR to help reviewers
// Don't reset the current operation if the error is retriable | ||
// or requires an orphan mitigation. | ||
// Only reset the OSB operation status | ||
clearServiceInstanceAsyncOsbOperation(instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kibbles-n-bytes note that I am reusing your newly introduced method there and in the update failure handling too.
Please double check if this is a correct behavior.
The goal here is to keep the OperationStartTime
untouched to avoid retrying indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea behind this seems ok to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nilebox As a consequence of this, it seems like we'll only be trying the deprovision once in the case of a final orphan mitigation after a reconciliation retry duration timeout during a provision. For now, I think it's fine, though we should re-look at it in the future.
// But we still need to return a non-nil error for retriable errors and | ||
// orphan mitigation to avoid resetting the rate limiter. | ||
if failedCond == nil || shouldMitigateOrphan { | ||
return errorMessage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comment above that clarifies why do we need to return a non-nil error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retries seem like a good idea to me, and my understanding is that they don't violate the OSB spec. Code looks fine too.
LGTM
// Don't reset the current operation if the error is retriable | ||
// or requires an orphan mitigation. | ||
// Only reset the OSB operation status | ||
clearServiceInstanceAsyncOsbOperation(instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea behind this seems ok to me
LGTM. RIP everyone wishing to fix the resources after they've fat-fingered a parameter, though. 💀 Making |
Retriggering Jenkins just for peace of mind. If it's green, please merge. |
Let's not introduce a regression here - people already have issues with recovering from a change to a bad parameter state. Can you describe the situation you're talking about? |
@kibbles-n-bytes Nope. Orphan mitigation will be retried for a week (by default). And if the max retry duration is exceeded, we will stop retrying. But after the spec is updated, it will trigger orphan mitigation first, followed by reconciling a new spec (only after/if orphan mitigation has succeeded). If you disagree with this behavior, or think that the code behaves differently, please comment. |
@pmorie just to be clear here: currently people have no way of recovering from a change to a bad parameter state at all.
So it's alone a huge improvement. And there could not be a regression for behavior which was missing before. On the other hand, and that's what @kibbles-n-bytes is referring to, the retry loop also makes #1755 more obviously annoying for non-happy-path: the spec will be locked for a week if OSB broker continuously returns an error on provisioning (any error code except I definitely agree that we need to fix #1755 ASAP, and that's why I have been pushing for it for weeks now. |
@kibbles-n-bytes ah sorry, you mean the ones that are already marked with the terminal |
Let's be clear though, we could remove the lock for the "same user" to unblock most people and not have any concern about OSB API spec violations. That would be an easy fix (I think), that could probably be done ( by someone who knows the code) in short order. |
@duglin what I meant to say is that current behavior is unacceptable. One way or another, we need to fix that ASAP to whatever we all agree to. |
yes but my point is that the urgency of this can be GREATLY diminished if we would just unlock it for the same user. That would give us time to make the right design decision (for kube and osbapi compliance) and not make a rushed one. I doubt this would take more than a few lines of code to do - it should just require a modification to the lock check to add something like |
@@ -761,14 +769,12 @@ func (c *controller) pollServiceInstance(instance *v1beta1.ServiceInstance) erro | |||
reason := errorProvisionCallFailedReason | |||
message := "Provision call failed: " + description | |||
readyCond := newServiceInstanceReadyCondition(v1beta1.ConditionFalse, reason, message) | |||
failedCond := newServiceInstanceFailedCondition(v1beta1.ConditionTrue, reason, message) | |||
err = c.processProvisionFailure(instance, readyCond, failedCond, false) | |||
err = c.processTemporaryProvisionFailure(instance, readyCond, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduced a bug where if the polling operation returns state=failed
, in combo with the change on this line: https://github.com/kubernetes-incubator/service-catalog/blob/e15b73719911853d3755b71f3d8b26b21296d0a3/pkg/controller/controller_instance.go#L1652
causes a failed provision after polling to never attempt deprovision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n3wscott apologies, this is because I thought that for async operations we never need to perform an orphan mitigation.
If what @kibbles-n-bytes wrote about bindings in openservicebrokerapi/servicebroker#334 (comment) is true about instances too, then we could just change a flag to true
there:
err = c.processTemporaryProvisionFailure(instance, readyCond, true)
i.e. perform orphan mitigation immediately instead of waiting for deletion of the instance.
Thoughts?
@n3wscott also, can you create an issue to track this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had: #1879
Fixes #1715
The final, third part (item
1.
) of the TODO list #1715 (comment)item
2.
: #1751item
3.
: #1748